video and audio
UniForm: A Unified Diffusion Transformer for Audio-Video Generation
Zhao, Lei, Feng, Linfeng, Ge, Dongxu, Yi, Fangqiu, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong
As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Mitsufuji, Yuki
In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular crossattention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods. Diffusion models have made great strides in the last few years in various generation tasks across modalities including image, video, and audio (Yang et al., 2023).
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Meta plans to ramp up labeling of AI-generated images across its platforms
Meta plans to ramp up its labeling of AI-generated images across Facebook, Instagram and Threads to help make it clear that the visuals are artificial. It's part of a broader push to tamp down misinformation and disinformation, which is particularly significant as we wrangle with the ramifications of generative AI (GAI) in a major election year in the US and other countries. According to Meta's president of global affairs, Nick Clegg, the company has been working with partners from across the industry to develop standards that include signifiers that an image, video or audio clip has been generated using AI. "Being able to detect these signals will make it possible for us to label AI-generated images that users post to Facebook, Instagram and Threads," Clegg wrote in a Meta Newsroom post. "We're building this capability now, and in the coming months we'll start applying labels in all languages supported by each app."
- Media > News (0.91)
- Government > Regional Government > North America Government > United States Government (0.36)
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
Yang, Ruihan, Gamper, Hannes, Braun, Sebastian
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences. Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.
- North America > United States > California > Orange County > Irvine (0.14)
- North America > United States > Washington > King County > Redmond (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.95)
We Haven't Seen the Worst of Fake News
It was 2018, and the world as we knew it--or rather, how we knew it--teetered on a precipice. Against a rising drone of misinformation, The New York Times, the BBC, Good Morning America, and just about everyone else sounded the alarm over a new strain of fake but highly realistic videos. Using artificial intelligence, bad actors could manipulate someone's voice and face in recorded footage almost like a virtual puppet and pass the product off as real. In a famous example engineered by BuzzFeed, Barack Obama seemed to say, "President Trump is a total and complete dipshit." Synthetic photos, audio, and videos, collectively dubbed "deepfakes," threatened to destabilize society and push us into a full-blown "infocalypse."
- Asia > North Korea (0.29)
- Asia > Russia (0.14)
- North America > United States > New York (0.04)
- (5 more...)
- Media > News (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.87)
La veille de la cybersécurité
Is that really Tom Cruise about to wrestle an alligator? Keanu Reeves dancing like nobody is watching? Deepfake technology is advanced artificial intelligence that replaces actual video and audio with video and audio that was artificially created from other sources. While it may look like harmless fun on TikTok, it's also becoming a huge security risk for businesses of all sizes. According to a just released report from the cloud service firm VMware, deepfake attacks are on the rise.
Are these guys for real? How to keep your business safe from deepfakes
Is that really Tom Cruise about to wrestle an alligator? Keanu Reeves dancing like nobody is watching? Deepfake technology is advanced artificial intelligence that replaces actual video and audio with video and audio that was artificially created from other sources. While it may look like harmless fun on TikTok, it's also becoming a huge security risk for businesses of all sizes. According to a just released report from the cloud service firm VMware, deepfake attacks are on the rise.
- Europe > United Kingdom (0.06)
- Asia > China > Hong Kong (0.06)
Death, resurrection and digital immortality in an AI world
Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! I have been thinking about death lately. Possibly because I recently had a month-long bout of Covid-19. And, I read a recent story about the passing of the actor Ed Asner, famous for his role as Lou Grant in "The Mary Tyler Moore Show."
The impact of deepfakes: How do you know when a video is real?
In a world where seeing is increasingly no longer believing, experts are warning that society must take a multi-pronged approach to combat the potential harms of computer-generated media. As Bill Whitaker reports this week on 60 Minutes, artificial intelligence can manipulate faces and voices to make it look like someone said something they never said. The result is videos of things that never happened, called "deepfakes." Often, they look so real, people watching can't tell. Even Justin Bieber has been tricked by a series of deepfake videos on the social media video platform TikTok that appeared to be of Tom Cruise.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > California (0.05)
- Information Technology > Security & Privacy (1.00)
- Media > News (0.71)
- Law (0.71)
The impact of deepfakes: How do you know when a video is real?
In a world where seeing is increasingly no longer believing, experts are warning that society must take a multi-pronged approach to combat the potential harms of computer-generated media. As Bill Whitaker reports this week on 60 Minutes, artificial intelligence can manipulate faces and voices to make it look like someone said something they never said. The result is videos of things that never happened, called "deepfakes." Often, they look so real, people watching can't tell. Just this month, Justin Bieber was tricked by a series of deepfake videos on the social media video platform TikTok that appeared to be of Tom Cruise.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > California (0.05)
- Information Technology > Security & Privacy (1.00)
- Media > News (0.71)
- Law (0.71)